In [104]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:70% !important; }</style>"))

Objective of this Project: to implement a comprehensive data analysis workflow using Python and database persistence, focusing on the New York City (NYC) and San Francisco Tree Census datasets.¶

Project Owner: Bek Tashpulatov

Table of Contents

  • 1  Data Acquisition
    • 1.1  Data preparation, inspection and sorting
  • 2  Acquisition of SF Tree Census
  • 3  Inspection and Data Wrangling
  • 4  Data Sets Integration, Merge and Analysis
    • 4.1  Analysis of the similarities between two cities
  • 5  Analysis of the NYC Tree Census 2015
    • 5.1  Overview of general tree population by boroughs
    • 5.2  Analysis of the top ten species distribution by boroughs
    • 5.3  Analysis of the diameter size of the top five most common species
    • 5.4  Overview of the overall health of representative sample
    • 5.5  Mapping location of the species base on the coordinates of drawn sample
  • 6  Data persistence into 2 different formats (SQLite, HDF5)
  • 7  Conclusion

Introduction¶

This report aims to demonstrate implementation of a complete data analysis workflow, including but not limited to data acquisition, data wrangling, data integration, data persistence and data analysis. We will use Python and other open source technologies while working with the New York City (NYC) Tree Census 2015 and the San Francisco (SF) Tree Census datasets. The objective of our analysis is to gain better understanding of NYC Tree Census, including tree population by boroughs of NYC, overall health status, visual representation of the location of surveyed data and comparing common species with Tree Census of SF.

Report structure as follows:

  • Data Acquisition: Data gathered from two different sources, including static datasets and web APIs.
  • Exploratory Data Analysis and Data Wrangling: Initial exploratory data analysis and data cleaning to prepare the datasets for analysis.
  • Data Sets Integration and Analysis: In-depth data analysis of created datasets.
  • Queries, Cross-Tabulation, and Visualizations: The use of Python libraries such as Matplotlib, Pandas, NumPy, Plotly, Gmap, Seaborn, and others to create queries, cross-tabulations, and visualizations in analysis.
  • Data Persistence: We store the analyzed data in two different formats, SQLite and HDF5.

For above purposes, we have imported Matplotlib, Pandas, Numpy, Plotly, Gmap, Seaborn and other libraries.

In [105]:
import pandas as pd
from pandas import Series, DataFrame
import numpy as np
import re
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from pandas.plotting import scatter_matrix
import seaborn as sns
from pylab import rcParams
import requests 
from bs4 import BeautifulSoup 
import json
import sqlite3
import datetime as dt
import squarify
import plotly
import plotly.plotly as py
import plotly.offline as offline 
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objs as go
import ipywidgets as widgets
from scipy import special
import joypy
from mapboxgl.utils import *
from mapboxgl.viz import *
from pylab import imread,subplot,imshow,show
pd.set_option('max_columns',None)
pd.set_option('max_rows',None)
In [106]:
%matplotlib inline

Data Acquisition¶

We acquired the NYC Tree Census data from the NYC Open Data web page through their API. The dataset contained roughly 700,000 observations, but API's by default limited allowed to retrieve only 50,000 observations. To fulfill the requirements of our analysis, we manually increased the limit to 300,000 observations.

Based on the objective of the project, we acquired the second dataset by downloading the static SF Tree Census dataset and subsequently uploading it to our Python environment.

According to the NYC Open Data Source, the Tree Census was "conducted by volunteers and staff organized by NYC Parks & Recreation and partner organizations. Tree data collected includes tree species, diameter and perception of health."

  • NYC Open Data Source, retrieved from :https://data.cityofnewyork.us/Environment/2015-Street-Tree-Census-Tree-Data/uvpi-gqnh
  • SF Open Data Source, retrieved from: https://data.sfgov.org/City-Infrastructure/Street-Tree-List/tkzw-k3nq
In [107]:
response = requests.get("https://data.cityofnewyork.us/resource/uvpi-gqnh.json?$limit=300000&$offset=0") #we set the limit for 300,000 due to the problems with the memory error.
print(response)  #Randomly selected 300K trees for analysis from data set 
result=json.loads(response.content)
results_df = pd.DataFrame.from_records(result) 
results_df.head(5)# Convert to pandas DataFrame
<Response [200]>
Out[107]:
address bbl bin block_id boro_ct borocode boroname brch_light brch_other brch_shoe cb_num census_tract cncldist council_district created_at curb_loc guards health latitude longitude nta nta_name problems root_grate root_other root_stone sidewalk spc_common spc_latin st_assem st_senate state status steward stump_diam tree_dbh tree_id trnk_light trnk_other trunk_wire user_type x_sp y_sp zip_city zipcode
0 108-005 70 AVENUE 4022210001 4052307 348711 4073900 4 Queens No No No 406 739 29 29 2015-08-27T00:00:00.000 OnCurb None Fair 40.72309177 -73.84421522 QN17 Forest Hills None No No No NoDamage red maple Acer rubrum 28 16 New York Alive None 0 3 180683 No No No TreesCount Staff 1027431.148 202756.7687 Forest Hills 11375
1 147-074 7 AVENUE 4044750045 4101931 315986 4097300 4 Queens No No No 407 973 19 19 2015-09-03T00:00:00.000 OnCurb None Fair 40.79411067 -73.81867946 QN49 Whitestone Stones No No Yes Damage pin oak Quercus palustris 27 11 New York Alive None 0 21 200540 No No No TreesCount Staff 1034455.701 228644.8374 Whitestone 11357
2 390 MORGAN AVENUE 3028870001 3338310 218365 3044900 3 Brooklyn No No No 301 449 34 34 2015-09-05T00:00:00.000 OnCurb None Good 40.71758074 -73.9366077 BK90 East Williamsburg None No No No Damage honeylocust Gleditsia triacanthos var. inermis 50 18 New York Alive 1or2 0 3 204026 No No No Volunteer 1001822.831 200716.8913 Brooklyn 11211
3 1027 GRAND STREET 3029250001 3338342 217969 3044900 3 Brooklyn No No No 301 449 34 34 2015-09-05T00:00:00.000 OnCurb None Good 40.71353749 -73.93445616 BK90 East Williamsburg Stones No No Yes Damage honeylocust Gleditsia triacanthos var. inermis 53 18 New York Alive None 0 10 204337 No No No Volunteer 1002420.358 199244.2531 Brooklyn 11211
4 603 6 STREET 3010850052 3025654 223043 3016500 3 Brooklyn No No No 306 165 39 39 2015-08-30T00:00:00.000 OnCurb None Good 40.66677776 -73.97597938 BK37 Park Slope-Gowanus Stones No No Yes Damage American linden Tilia americana 44 21 New York Alive None 0 21 189565 No No No Volunteer 990913.775 182202.426 Brooklyn 11215
In [108]:
results_df.describe()
Out[108]:
address bbl bin block_id boro_ct borocode boroname brch_light brch_other brch_shoe cb_num census_tract cncldist council_district created_at curb_loc guards health latitude longitude nta nta_name problems root_grate root_other root_stone sidewalk spc_common spc_latin st_assem st_senate state status steward stump_diam tree_dbh tree_id trnk_light trnk_other trunk_wire user_type x_sp y_sp zip_city zipcode
count 300000 296319 296319 300000 300000 300000 300000 300000 300000 300000 300000 297565 300000 297565 300000 300000 285601 285600 300000 300000 300000 300000 285583 300000 300000 300000 285601 285602 285602 300000 300000 300000 300000 285601 300000 300000 300000 300000 300000 300000 300000 300000 300000 300000 300000
unique 187703 168591 173905 49950 2030 5 5 2 2 2 59 1290 51 51 220 2 4 3 298478 298515 188 188 190 2 2 2 2 129 129 65 26 1 3 4 92 112 300000 2 2 2 3 299510 299714 48 187
top 2750 VETERANS ROAD WEST 2049050001 2000000 208115 5020801 3 Brooklyn No No No 503 20801 51 51 2015-10-27T00:00:00.000 OnCurb None Good 40.61097709 -74.15686098 BK82 East New York None No No No NoDamage London planetree Platanus x acerifolia 62 24 New York Alive None 0 4 381746 No No No NYC Parks Staff 940697.4014 161910.8114 Brooklyn 10312
freq 84 121 267 248 3504 76380 76380 272592 288690 299811 29041 3343 26601 26492 6769 288584 249292 228272 28 28 8181 8181 189251 298288 286858 242967 205934 36919 36919 25160 48146 300000 285601 211750 292056 27925 1 299560 286365 294626 114377 28 28 76380 11118

Data preparation, inspection and sorting¶

  • Dropped variables not relevant to our analysis
  • Updated dataframe to make sure that relevant columns in appropriate format
In [109]:
df_api = results_df
df_api = df_api.drop(["bbl", "st_assem", "st_senate", "bin", "block_id", "boro_ct","borocode", "brch_light","brch_other", "brch_shoe","cncldist","cb_num","census_tract","council_district","created_at","nta","root_grate","root_other","root_stone","sidewalk","steward","stump_diam","trnk_light","trnk_other","user_type"], axis=1)
In [110]:
df_api.dtypes
Out[110]:
address       object
boroname      object
curb_loc      object
guards        object
health        object
latitude      object
longitude     object
nta_name      object
problems      object
spc_common    object
spc_latin     object
state         object
status        object
tree_dbh      object
tree_id       object
trunk_wire    object
x_sp          object
y_sp          object
zip_city      object
zipcode       object
dtype: object

Review of the columns format for possible conversion to appropriate format.

In [111]:
# Define a list of columns to convert and their respective data types
columns_to_convert = {
    "latitude": float,
    "longitude": float,
    "x_sp": float,
    "y_sp": float,
    "tree_dbh": int,
    "zipcode": int
}

# Loop through the columns and apply the specified data type conversion
for column, data_type in columns_to_convert.items():
    df_api[column] = df_api[column].astype(data_type)

Conversion of the selected columns in to appropriate format for further analysis.

In [112]:
print(df_api.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300000 entries, 0 to 299999
Data columns (total 20 columns):
address       300000 non-null object
boroname      300000 non-null object
curb_loc      300000 non-null object
guards        285601 non-null object
health        285600 non-null object
latitude      300000 non-null float64
longitude     300000 non-null float64
nta_name      300000 non-null object
problems      285583 non-null object
spc_common    285602 non-null object
spc_latin     285602 non-null object
state         300000 non-null object
status        300000 non-null object
tree_dbh      300000 non-null int32
tree_id       300000 non-null object
trunk_wire    300000 non-null object
x_sp          300000 non-null float64
y_sp          300000 non-null float64
zip_city      300000 non-null object
zipcode       300000 non-null int32
dtypes: float64(4), int32(2), object(14)
memory usage: 43.5+ MB
None

Based on above general info of our data frame, we can conclude that major columns like: coordinates, tree diameter size, zipcode and others were converted in to appropriate format. Some observations have missing values.

Acquisition of SF Tree Census¶

In accordance with the project objectives, we acquired the SF Tree Census dataset, stored in CSV format, which was then uploaded into our Python environment.

To ensure the data's relevance, we removed irrelevant columns and updated the DataFrame, making sure that the remaining columns are in their appropriate formats. The dataset comprises 190,000 observations.

For reference, you can access the dataset at the following link: Street Tree List - https://data.sfgov.org/City-Infrastructure/Street-Tree-List/tkzw-k3nq

In [113]:
# https://data.sfgov.org/City-Infrastructure/Street-Tree-List/tkzw-k3nq
results_sf = pd.read_csv('street-tree-list.csv', na_values=['???','None','none','null','NULL','nan','NAN','-'])
In [114]:
results_sf.head(5)
Out[114]:
TreeID qLegalStatus qSpecies qAddress SiteOrder qSiteInfo PlantType qCaretaker qCareAssistant PlantDate DBH PlotSize PermitNotes XCoord YCoord Latitude Longitude Location Fire Prevention Districts Police Districts Supervisor Districts Zip Codes Neighborhoods
0 196949 DPW Maintained Pyrus calleryana :: Ornamental Pear 9 Young Ct 1.0 Sidewalk: Curb side : Cutout Tree Private NaN NaN 16.0 Width 3ft NaN 6.016267e+06 2.096084e+06 37.736362 -122.386202 {'longitude': '-122.38620200123', 'human_addre... 10.0 3.0 8.0 58.0 1.0
1 203422 DPW Maintained Acer rubrum :: Red Maple 9 Yerba Buena Ave 1.0 Sidewalk: Curb side : Yard Tree Private NaN NaN 2.0 Width 4ft NaN 5.993355e+06 2.097295e+06 37.738392 -122.465507 {'longitude': '-122.465506999949', 'human_addr... 1.0 8.0 4.0 59.0 40.0
2 115737 Significant Tree Acer rubrum :: Red Maple 9x Yerba Buena Ave 1.0 Sidewalk: Curb side : Cutout Tree Private NaN 2016-02-24T00:00:00.000 3.0 NaN Permit Number 776557 5.993642e+06 2.097056e+06 37.737752 -122.464496 {'longitude': '-122.46449593033', 'human_addre... 1.0 8.0 4.0 59.0 40.0
3 16476 Permitted Site Eucalyptus nicholii :: Nichol's Willow-Leafed ... 9X Newhall St 10.0 Sidewalk: Curb side : Cutout Tree Private NaN NaN 12.0 3X3 NaN 6.018697e+06 2.097076e+06 37.739219 -122.377869 {'longitude': '-122.377869364283', 'human_addr... 10.0 3.0 8.0 58.0 1.0
4 16473 DPW Maintained Eucalyptus nicholii :: Nichol's Willow-Leafed ... 9X Newhall St 7.0 Sidewalk: Curb side : Cutout Tree DPW NaN NaN 12.0 3X3 NaN 6.018697e+06 2.097076e+06 37.739219 -122.377869 {'longitude': '-122.377869364283', 'human_addr... 10.0 3.0 8.0 58.0 1.0
In [115]:
# Remove irrelevant columns from the DataFrame based on the previous step
columns_to_drop = [
    "qLegalStatus", "SiteOrder", "PlantType", "qCaretaker",
    "qCareAssistant", "PlantDate", "PlotSize", "PermitNotes",
    "XCoord", "YCoord", "Location", "Fire Prevention Districts",
    "Police Districts", "Supervisor Districts", "Zip Codes", "Neighborhoods"
]

test_sf = results_sf.drop(columns_to_drop, axis=1)
In [116]:
test_sf.tail()
Out[116]:
TreeID qSpecies qAddress qSiteInfo DBH Latitude Longitude
191529 48137 Tree(s) :: NaN Sidewalk: Curb side : Cutout NaN NaN NaN
191530 48342 Tree(s) :: NaN Sidewalk: Curb side : Cutout NaN NaN NaN
191531 48276 Tree(s) :: NaN Sidewalk: Curb side : Cutout NaN NaN NaN
191532 46884 Tree(s) :: NaN Sidewalk: Curb side : Cutout NaN NaN NaN
191533 35284 Olea europaea :: Olive Tree NaN Sidewalk: Curb side : Cutout NaN NaN NaN
In [117]:
print(test_sf.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 191534 entries, 0 to 191533
Data columns (total 7 columns):
TreeID       191534 non-null int64
qSpecies     191534 non-null object
qAddress     190044 non-null object
qSiteInfo    191534 non-null object
DBH          150299 non-null float64
Latitude     188869 non-null float64
Longitude    188869 non-null float64
dtypes: float64(3), int64(1), object(3)
memory usage: 10.2+ MB
None

Inspection and Data Wrangling¶

Upon inspecting the DataFrame, it was observed that the "Species" column contains both Latin and Common names, separated by "::". Given the non-research nature of this report, it was decided to retain the Common names of the tree species.

During the inspection of the dataset, it was noted that some species have numerous variations under one overarching species, such as "Skeena Cherries," "Bing Cherries," "Lapin Cherries," and others. For the specific purpose of this analysis, it was decided to consolidate these variations into a single common name. For example, all variations of cherries were renamed to "Cherry."

In [118]:
test_sf_info = test_sf.groupby('qSpecies')['TreeID'].nunique().nlargest(35)
test_sf_info
Out[118]:
qSpecies
Tree(s) ::                                                                   11597
Platanus x hispanica :: Sycamore: London Plane                               11499
Metrosideros excelsa :: New Zealand Xmas Tree                                 8704
Lophostemon confertus :: Brisbane Box                                         8519
Pittosporum undulatum :: Victorian Box                                        7097
Tristaniopsis laurina :: Swamp Myrtle                                         7007
Prunus cerasifera :: Cherry Plum                                              6702
Magnolia grandiflora :: Southern Magnolia                                     6272
Arbutus 'Marina' :: Hybrid Strawberry Tree                                    5652
Ficus microcarpa nitida 'Green Gem' :: Indian Laurel Fig Tree 'Green Gem'     5641
Prunus serrulata 'Kwanzan' :: Kwanzan Flowering Cherry                        4028
Acacia melanoxylon :: Blackwood Acacia                                        3956
Maytenus boaria :: Mayten                                                     3895
Olea europaea :: Olive Tree                                                   3682
Corymbia ficifolia :: Red Flowering Gum                                       3565
Callistemon citrinus :: Lemon Bottlebrush                                     3267
Ginkgo biloba :: Maidenhair Tree                                              3192
Pyrus calleryana :: Ornamental Pear                                           2981
Prunus serrulata :: Ornamental Cherry                                         2674
Ulmus parvifolia :: Chinese Elm                                               2348
Eriobotrya deflexa :: Bronze Loquat                                           2345
Pinus radiata :: Monterey Pine                                                2229
Ligustrum lucidum :: Glossy Privet                                            2194
Pyrus kawakamii :: Evergreen Pear                                             1948
::                                                                            1898
Cupressus macrocarpa :: Monterey Cypress                                      1886
Pittosporum crassifolium :: Karo Tree                                         1792
Tristaniopsis laurina 'Elegant' :: Small-leaf Tristania 'Elegant'             1783
Melaleuca quinquenervia :: Cajeput                                            1703
Cordyline australis :: Dracena Palm                                           1624
Ficus nitida :: Laurel Fig                                                    1612
Myoporum laetum :: Myoporum                                                   1600
Liquidambar styraciflua :: American Sweet Gum                                 1514
Ficus retusa nitida :: Banyan Fig                                             1417
Juniperus chinensis :: Juniper                                                1405
Name: TreeID, dtype: int64

As we can observe from the data set, species column (qSpecies) has many unidentified tree names described by "Tree(s) ::" or "::", taking in to the account nature of this exercise, it was decided to drop trees with unidentified names and focus only on common names. To achieve this, an applymap lambda function was applied to select only the common names and drop trees with unidentified names. This process ensures that the analysis will focus on more recognizable and informative data.

In [119]:
# Define a function to extract common names from species entries
def namechange(x):
    if type(x)==str:
        if '::' in x:
            y = re.sub('.+::','',x)
        else:
            y=x
    else:
        y=x
    return y

# Apply the namechange function to the entire DataFrame, focusing on the "qSpecies" column
test_sf = test_sf.applymap(namechange)

# Resources used:
# -GeeksforGeeks: https://www.geeksforgeeks.org/python-pandas-dataframe-applymap/
# - Stack Overflow: https://stackoverflow.com/questions/48819767/pandas-dataframe-apply-lambda-if-else-erro
# - Stack Overflow: https://stackoverflow.com/questions/14029245/putting-an-if-elif-else-statement-on-one-line/14029436
In [120]:
test_sf.head(5)
Out[120]:
TreeID qSpecies qAddress qSiteInfo DBH Latitude Longitude
0 196949 Ornamental Pear 9 Young Ct Sidewalk: Curb side : Cutout 16.0 37.736362 -122.386202
1 203422 Red Maple 9 Yerba Buena Ave Sidewalk: Curb side : Yard 2.0 37.738392 -122.465507
2 115737 Red Maple 9x Yerba Buena Ave Sidewalk: Curb side : Cutout 3.0 37.737752 -122.464496
3 16476 Nichol's Willow-Leafed Peppermint 9X Newhall St Sidewalk: Curb side : Cutout 12.0 37.739219 -122.377869
4 16473 Nichol's Willow-Leafed Peppermint 9X Newhall St Sidewalk: Curb side : Cutout 12.0 37.739219 -122.377869
In [121]:
# Replace empty strings in the 'qSpecies' column with NaN
test_sf['qSpecies'].replace('', np.nan, inplace=True)
In [122]:
# Drop rows with NaN values in the 'qSpecies' column
test_sf.dropna(subset=['qSpecies'], inplace=True)
test_sf.head(20)
Out[122]:
TreeID qSpecies qAddress qSiteInfo DBH Latitude Longitude
0 196949 Ornamental Pear 9 Young Ct Sidewalk: Curb side : Cutout 16.0 37.736362 -122.386202
1 203422 Red Maple 9 Yerba Buena Ave Sidewalk: Curb side : Yard 2.0 37.738392 -122.465507
2 115737 Red Maple 9x Yerba Buena Ave Sidewalk: Curb side : Cutout 3.0 37.737752 -122.464496
3 16476 Nichol's Willow-Leafed Peppermint 9X Newhall St Sidewalk: Curb side : Cutout 12.0 37.739219 -122.377869
4 16473 Nichol's Willow-Leafed Peppermint 9X Newhall St Sidewalk: Curb side : Cutout 12.0 37.739219 -122.377869
5 16470 Red Ironbark 9X Newhall St Sidewalk: Curb side : Cutout 12.0 37.739219 -122.377869
6 16475 Nichol's Willow-Leafed Peppermint 9X Newhall St Sidewalk: Curb side : Cutout 18.0 37.739219 -122.377869
7 16478 Flaxleaf Paperbark 9X Newhall St Sidewalk: Curb side : Cutout 6.0 37.739219 -122.377869
8 16472 Nichol's Willow-Leafed Peppermint 9X Newhall St Sidewalk: Curb side : Cutout 18.0 37.739219 -122.377869
9 16471 Nichol's Willow-Leafed Peppermint 9X Newhall St Sidewalk: Curb side : Cutout 3.0 37.739219 -122.377869
10 13304 Red Flowering Gum 9X Lyell St Median : Cutout 12.0 37.731771 -122.431676
11 13302 Red Flowering Gum 9X Lyell St Median : Cutout 12.0 37.731771 -122.431676
12 13301 Red Flowering Gum 9X Lyell St Median : Cutout 18.0 37.731771 -122.431676
13 13303 Red Flowering Gum 9X Lyell St Median : Cutout 18.0 37.731771 -122.431676
14 102172 Kwanzan Flowering Cherry 9X Hattie St Sidewalk: Curb side : Cutout 4.0 37.761558 -122.440979
15 102171 Kwanzan Flowering Cherry 9X Hattie St Sidewalk: Curb side : Cutout 4.0 37.761558 -122.440979
16 12977 Mayten 9X Castro St Sidewalk: Curb side : Cutout 4.0 37.769905 -122.436179
17 116748 Swamp Myrtle 9x BLANKEN AVE Sidewalk: Curb side : Cutout NaN 37.712291 -122.401484
18 116746 Swamp Myrtle 9x BLANKEN AVE Sidewalk: Curb side : Cutout NaN 37.712291 -122.401484
19 116747 Swamp Myrtle 9x BLANKEN AVE Sidewalk: Curb side : Cutout NaN 37.712291 -122.401484

Above steps are used to replace empty strings ('') in the "qSpecies" column with NaN (Not a Number) and then remove rows with NaN values from the dataset.

In [123]:
# NaN cleaned dataset renamed
fixed_sf = test_sf
In [124]:
# Define a function to standardize tree species names
def treename(x):
    if type(x)==str:
        if ' Cherry Plum' in x:
            y = 'purple-leaf plum'
        elif 'Cherry' in x:
            y= 'cherry'
        elif 'Sycamore: London Plane' in x:
            y= 'London planetree'
        elif 'Southern Magnolia' in x:
            y= 'southern magnolia'
        else:
            y=x
    else:
        y=x
    return y

# Apply the treename function to the entire DataFrame, standardizing tree species names
fixed_sf = fixed_sf.applymap(treename)

#Resources used:
# - GeeksforGeeks: https://www.geeksforgeeks.org/python-pandas-dataframe-applymap/
# - Stack Overflow: https://stackoverflow.com/questions/48819767/pandas-dataframe-apply-lambda-if-else-erro
# - Stack Overflow: https://stackoverflow.com/questions/14029245/putting-an-if-elif-else-statement-on-one-line/14029436

In this section of the analysis, a function named treename is defined and applied to the DataFrame fixed_sf. The primary objective of this code is to standardize the tree species names by mapping different variations to common names. This step is taken to enhance the ease of further data analysis and improve overall understanding. Considering the non-scientific nature of this report, the decision was made to utilize common names for tree species. This simplification process resolves variations in species names, particularly addressing the issue of high variability in some species, such as 'Cherry,' which is now uniformly represented as 'cherry.'.

In [125]:
fixed_sf.head(10)
Out[125]:
TreeID qSpecies qAddress qSiteInfo DBH Latitude Longitude
0 196949 Ornamental Pear 9 Young Ct Sidewalk: Curb side : Cutout 16.0 37.736362 -122.386202
1 203422 Red Maple 9 Yerba Buena Ave Sidewalk: Curb side : Yard 2.0 37.738392 -122.465507
2 115737 Red Maple 9x Yerba Buena Ave Sidewalk: Curb side : Cutout 3.0 37.737752 -122.464496
3 16476 Nichol's Willow-Leafed Peppermint 9X Newhall St Sidewalk: Curb side : Cutout 12.0 37.739219 -122.377869
4 16473 Nichol's Willow-Leafed Peppermint 9X Newhall St Sidewalk: Curb side : Cutout 12.0 37.739219 -122.377869
5 16470 Red Ironbark 9X Newhall St Sidewalk: Curb side : Cutout 12.0 37.739219 -122.377869
6 16475 Nichol's Willow-Leafed Peppermint 9X Newhall St Sidewalk: Curb side : Cutout 18.0 37.739219 -122.377869
7 16478 Flaxleaf Paperbark 9X Newhall St Sidewalk: Curb side : Cutout 6.0 37.739219 -122.377869
8 16472 Nichol's Willow-Leafed Peppermint 9X Newhall St Sidewalk: Curb side : Cutout 18.0 37.739219 -122.377869
9 16471 Nichol's Willow-Leafed Peppermint 9X Newhall St Sidewalk: Curb side : Cutout 3.0 37.739219 -122.377869

Data Sets Integration, Merge and Analysis¶

For our project, we picked around ten thousand observations randomly from each dataset. We merged these two datasets and sorted out column names where needed. Our goal was to identify tree species with similar common names in both regions.

In [126]:
# Select a random sample of 10,000 rows from the 'fixed_sf' DataFrame
random_sf = fixed_sf.sample(n=10000)
print(random_sf.head())
        TreeID            qSpecies           qAddress  \
73862   220960   Lemon Bottlebrush       3320 21st St   
133934   20830    Monterey Cypress  1801X Sunset Blvd   
155020  125028    Blackwood Acacia  1454 Jefferson St   
55645   104347    London planetree       417 31st Ave   
138643   52295              Mayten      170 Eureka St   

                           qSiteInfo   DBH   Latitude   Longitude  
73862   Sidewalk: Curb side : Cutout  15.0  37.756928 -122.421665  
133934  Sidewalk: Curb side : Cutout  42.0  37.753398 -122.495116  
155020  Sidewalk: Curb side : Cutout   8.0  37.805625 -122.436531  
55645   Sidewalk: Curb side : Cutout  13.0  37.781347 -122.491355  
138643  Sidewalk: Curb side : Cutout   NaN  37.759565 -122.438248  
In [127]:
# Select a random sample of 10,000 rows from the 'df_api' DataFrame
random_api = df_api.sample(n=10000)
print(random_api.head())
                    address       boroname        curb_loc guards health  \
218816  504 WEST 180 STREET      Manhattan          OnCurb   None   Good   
191468    50-050 193 STREET         Queens          OnCurb   None   Good   
236046  600 HYLAN BOULEVARD  Staten Island  OffsetFromCurb   None   Good   
57450    231-021 128 AVENUE         Queens          OnCurb   None   Good   
279599       35 ROSE AVENUE  Staten Island          OnCurb   None   Fair   

         latitude  longitude                         nta_name problems  \
218816  40.847673 -73.932132         Washington Heights South     None   
191468  40.746803 -73.785647                       Auburndale     None   
236046  40.605185 -74.076844  Grasmere-Arrochar-Ft. Wadsworth     None   
57450   40.683267 -73.733852                        Laurelton     None   
279599  40.574779 -74.120321           New Dorp-Midland Beach   Stones   

               spc_common              spc_latin     state status  tree_dbh  \
218816  littleleaf linden          Tilia cordata  New York  Alive         4   
191468   London planetree  Platanus x acerifolia  New York  Alive        31   
236046              maple                   Acer  New York  Alive        11   
57450             pin oak      Quercus palustris  New York  Alive         8   
279599       Norway maple       Acer platanoides  New York  Alive        13   

       tree_id trunk_wire          x_sp         y_sp             zip_city  \
218816  409240         No  1.003027e+06  248114.7522             New York   
191468  397496         No  1.043644e+06  211429.7232        Fresh Meadows   
236046  432810         No  9.629124e+05  159770.8787        Staten Island   
57450   258271         No  1.058066e+06  188321.0217  Springfield Gardens   
279599  459385         No  9.508247e+05  148707.0389        Staten Island   

        zipcode  
218816    10033  
191468    11365  
236046    10305  
57450     11413  
279599    10306  

Considering the size of the both data frames , for analysis purposes ten thousand observations were randomly selected.

In [128]:
random_sf.describe()
Out[128]:
TreeID DBH Latitude Longitude
count 10000.000000 8266.000000 9919.000000 9919.000000
mean 129445.067300 10.220421 37.763637 -122.440801
std 79260.916592 23.347985 0.192503 0.319581
min 2.000000 0.000000 37.708340 -138.283252
25% 52188.750000 3.000000 37.740248 -122.453560
50% 126025.000000 7.000000 37.760086 -122.430999
75% 204253.000000 12.000000 37.779924 -122.412841
max 258218.000000 1920.000000 47.270219 -122.368426
In [129]:
# Rename columns in the 'random_sf' DataFrame for clarity and consistency
random_sf=random_sf.rename(columns = {'qSpecies':'spc_common'})
In [130]:
random_sf=random_sf.rename(columns = {'Latitude':'Latitude_SF'})
random_sf=random_sf.rename(columns = {'Longitude':'Longitude_SF'})

Columns renamed under same format for further merger of two data sets. Merger conducted based species common name.

In [131]:
# Group the most common tree species ('spc_common') and count the unique 'TreeID' values for each group.
# Then, select the top 10 groups with the largest counts.
ln_sf = random_sf.groupby('spc_common')['TreeID'].nunique().nlargest(10)  
ln_sf

#Resources used:
# - Stack Overflow: https://stackoverflow.com/questions/35364601/group-by-and-find-top-n-value-counts-pandas
Out[131]:
spc_common
London planetree                       660
 New Zealand Xmas Tree                 487
cherry                                 451
 Brisbane Box                          446
 Victorian Box                         411
purple-leaf plum                       388
 Swamp Myrtle                          382
southern magnolia                      374
 Hybrid Strawberry Tree                338
 Indian Laurel Fig Tree 'Green Gem'    310
Name: TreeID, dtype: int64
In [132]:
# Merge of the NYC and SF tree census datasets ('random_sf' and 'random_api' DataFrames) based on the 'spc_common' column using an inner join.
# This combines data where the 'spc_common' values match in both DataFrames.
merged = pd.merge(random_sf, random_api, on=['spc_common'], how='inner')
merged.head()
Out[132]:
TreeID spc_common qAddress qSiteInfo DBH Latitude_SF Longitude_SF address boroname curb_loc guards health latitude longitude nta_name problems spc_latin state status tree_dbh tree_id trunk_wire x_sp y_sp zip_city zipcode
0 104347 London planetree 417 31st Ave Sidewalk: Curb side : Cutout 13.0 37.781347 -122.491355 50-050 193 STREET Queens OnCurb None Good 40.746803 -73.785647 Auburndale None Platanus x acerifolia New York Alive 31 397496 No 1.043644e+06 211429.7232 Fresh Meadows 11365
1 104347 London planetree 417 31st Ave Sidewalk: Curb side : Cutout 13.0 37.781347 -122.491355 25 PINE TERRACE Staten Island OnCurb None Good 40.537586 -74.177937 Annadale-Huguenot-Prince's Bay-Eltingville None Platanus x acerifolia New York Alive 12 380728 No 9.347914e+05 135183.8448 Staten Island 10312
2 104347 London planetree 417 31st Ave Sidewalk: Curb side : Cutout 13.0 37.781347 -122.491355 1445 LONGFELLOW AVENUE Bronx OnCurb None Fair 40.830801 -73.887539 Crotona Park East Stones Platanus x acerifolia New York Alive 13 493987 No 1.015372e+06 241980.5482 Bronx 10459
3 104347 London planetree 417 31st Ave Sidewalk: Curb side : Cutout 13.0 37.781347 -122.491355 1883 VYSE AVENUE Bronx OnCurb None Fair 40.839751 -73.884283 East Tremont None Platanus x acerifolia New York Alive 11 441819 No 1.016269e+06 245242.5014 Bronx 10460
4 104347 London planetree 417 31st Ave Sidewalk: Curb side : Cutout 13.0 37.781347 -122.491355 837 CARROLL STREET Brooklyn OnCurb None Good 40.671808 -73.972644 Park Slope-Gowanus None Platanus x acerifolia New York Alive 18 207864 No 9.918386e+05 184035.3164 Brooklyn 11215
In [133]:
merged.to_csv("mergedDF1.csv", index=False)

Data set merged and saved as a CSV file. Given the size of the full data set from the NYC Tree Census API, it may not be practical to save it in a CSV file for further analysis.

In [134]:
# Group the 'merged' DataFrame by tree species ('spc_common') and count the unique 'tree_id' values for each group.
m_sfny = merged.groupby('spc_common')['tree_id'].nunique()  
m_sfny

#Source used:
#https://datascience.stackexchange.com/questions/33053/how-do-i-compare-columns-in-different-data-frames
Out[134]:
spc_common
London planetree     1246
cherry                441
purple-leaf plum      112
southern magnolia       3
Name: tree_id, dtype: int64
In [135]:
# Convert column names to lowercase for consistency
merged.columns = map(str.lower, merged.columns)

# Transform the 'spc_common' column to lowercase
merged['spc_common'].str.lower()

# Display the first few rows of the DataFrame to check the changes
merged.head()
Out[135]:
treeid spc_common qaddress qsiteinfo dbh latitude_sf longitude_sf address boroname curb_loc guards health latitude longitude nta_name problems spc_latin state status tree_dbh tree_id trunk_wire x_sp y_sp zip_city zipcode
0 104347 London planetree 417 31st Ave Sidewalk: Curb side : Cutout 13.0 37.781347 -122.491355 50-050 193 STREET Queens OnCurb None Good 40.746803 -73.785647 Auburndale None Platanus x acerifolia New York Alive 31 397496 No 1.043644e+06 211429.7232 Fresh Meadows 11365
1 104347 London planetree 417 31st Ave Sidewalk: Curb side : Cutout 13.0 37.781347 -122.491355 25 PINE TERRACE Staten Island OnCurb None Good 40.537586 -74.177937 Annadale-Huguenot-Prince's Bay-Eltingville None Platanus x acerifolia New York Alive 12 380728 No 9.347914e+05 135183.8448 Staten Island 10312
2 104347 London planetree 417 31st Ave Sidewalk: Curb side : Cutout 13.0 37.781347 -122.491355 1445 LONGFELLOW AVENUE Bronx OnCurb None Fair 40.830801 -73.887539 Crotona Park East Stones Platanus x acerifolia New York Alive 13 493987 No 1.015372e+06 241980.5482 Bronx 10459
3 104347 London planetree 417 31st Ave Sidewalk: Curb side : Cutout 13.0 37.781347 -122.491355 1883 VYSE AVENUE Bronx OnCurb None Fair 40.839751 -73.884283 East Tremont None Platanus x acerifolia New York Alive 11 441819 No 1.016269e+06 245242.5014 Bronx 10460
4 104347 London planetree 417 31st Ave Sidewalk: Curb side : Cutout 13.0 37.781347 -122.491355 837 CARROLL STREET Brooklyn OnCurb None Good 40.671808 -73.972644 Park Slope-Gowanus None Platanus x acerifolia New York Alive 18 207864 No 9.918386e+05 184035.3164 Brooklyn 11215
In [136]:
# Create a new datasets (DataFrame 'm_ns') by selecting specific columns from 'merged' dataset and renaming them for clarity.
m_ns = pd.DataFrame(merged, columns = ['spc_common','tree_dbh', 'dbh','tree_id','treeid'])
m_ns=m_ns.rename(columns = {'tree_dbh':'nyc_dbh'})
m_ns=m_ns.rename(columns = {'dbh':'sf_dbh'})
m_ns=m_ns.rename(columns = {'tree_id':'nyc_tree_id'})
m_ns=m_ns.rename(columns = {'treeid':'sf_tree_id'})
In [137]:
# Transform the 'spc_common' column to lowercase for consistency
m_ns['spc_common'] = m_ns['spc_common'].str.lower()
m_ns.dtypes
Out[137]:
spc_common      object
nyc_dbh          int32
sf_dbh         float64
nyc_tree_id     object
sf_tree_id       int64
dtype: object
In [138]:
# Convert the 'nyc_dbh' column to a float data type for precise decimal representation
m_ns["nyc_dbh"] = m_ns.nyc_dbh.astype(float)
# Convert the 'nyc_tree_id' column to an integer data type for whole number representation
m_ns["nyc_tree_id"] = m_ns.nyc_tree_id.astype(int)

Analysis of the similarities between two cities¶

To gain a better understanding of representative samples from both cities, we employed the 'groupby' function and exploratory statistics to analyze common tree species. Our analysis primarily focused on the Tree Diameter at Breast Height (DBH), a standard method of measuring the diameter of a tree trunk or bole, typically 1.3 meters above ground level. This approach allowed us to uncover similarities between the two cities in terms of tree species and their characteristics.

In [139]:
# Group the data in the dataset ('m_ns' DataFrame) by tree species ('spc_common') and calculate statistics on 'sf_dbh' - tree species diameter.
# The statistics include the minimum, maximum, and mean values for each tree species.
m_ns.groupby('spc_common').sf_dbh.agg(['min','max','mean']).round()
Out[139]:
min max mean
spc_common
cherry 1.0 42.0 5.0
london planetree 1.0 1920.0 16.0
purple-leaf plum 0.0 22.0 6.0
southern magnolia 0.0 80.0 7.0

Exploratory statistics on DBH of sample drawn from SF dataset.

In [140]:
m_ns.groupby('spc_common').nyc_dbh.agg(['min','max','mean']).round()
Out[140]:
min max mean
spc_common
cherry 0.0 23.0 5.0
london planetree 2.0 255.0 21.0
purple-leaf plum 1.0 13.0 5.0
southern magnolia 2.0 16.0 7.0

Exploratory statistics on DBH of sample drawn from NYC dataset.

In [141]:
# Group the data in merged dataset by tree species ('spc_common') and calculate the number of unique 'nyc_tree_id' and 'sf_tree_id' (common tree ) values for each species in both cities.
m_ns.groupby('spc_common')['nyc_tree_id','sf_tree_id'].nunique()
Out[141]:
nyc_tree_id sf_tree_id
spc_common
cherry 441 451
london planetree 1246 660
purple-leaf plum 112 388
southern magnolia 3 374

Observation:¶

Upon comparing prevalence of similar species in merged dataset, we observed that the London Plane tree is notably more common in New York City (NYC). It's worth noting that this tree species is so abundant in NYC that it falls under the NYC Parks Department's list of restricted-use species for street tree planting, as it constitutes more than 10% of all street trees in the city.

On the other hand, the Purple Leaf Plum tree, originally from Southeast Europe and known for its adaptability to milder climates, is more populous in San Francisco (SF). This phenomenon is likely attributed to the similarity in climate to its region of origin.

Additionally, when examining the average diameter at breast height, it became evident that London Plane trees in NYC are generally larger. This size difference is likely due to climate conditions, as the London Plane tree is native to the Northern Hemisphere.

In summary, this analysis highlights the different abundances and characteristics of these tree species in NYC and SF, which can be attributed to climate and historical factors.</font>

Please see below for graphical representation of the common species between two data sets.

Source retrieved from : https://en.wikipedia.org/wiki/Platanus_×_acerifolia / https://en.wikipedia.org/wiki/Cherry_plum

In [142]:
# Group the data in the dataset ('m_ns' DataFrame) by tree species ('spc_common') and calculate the number of unique 'nyc_tree_id' and 'sf_tree_id' values for each species for further analysis.
m_plot = m_ns.groupby('spc_common')['nyc_tree_id','sf_tree_id'].nunique()
In [143]:
# Adjust the figure size and font size for the plot
rcParams['figure.figsize'] = 18, 8
rcParams['font.size'] = 25

# Aesthetics: plot style set to 'ggplot'  
plt.style.use('ggplot')

# Bar plot to compare the number of unique tree identifiers for NYC and SF, using different colors for each city
m_plot.loc[:,['nyc_tree_id','sf_tree_id']].plot(kind='bar', color=['#bc8aad', '#F26458'])

plt.title('NYC vs SF trees', alpha=1.75, fontsize=15)
Out[143]:
Text(0.5, 1.0, 'NYC vs SF trees')

Analysis of the NYC Tree Census 2015¶

  • General Overview of Tree Population by Boroughs: An initial look at the distribution of trees across different boroughs.
  • Top Species Distribution by Boroughs: A deeper analysis to identify and understand the prevalence of the most common tree species in each borough.
  • Analysis of Diameter Sizes for the Top Five Common Species: A closer examination of the diameter sizes of the five most frequently occurring tree species.
  • Assessment of Overall Species Health by Boroughs: An evaluation of the overall health status of tree species in various boroughs.
  • Geographic Mapping of Species Locations: Mapping the locations of tree species based on sample coordinates to visualize their distribution.

An overview of the general tree population by NYC boroughs¶

The code below prepares and visualizes data to present a treemap illustrating the tree population across different NYC boroughs.

In [144]:
# Preparing Data: Grouping tree data by borough and calculating counts
df = df_api.groupby('boroname').size().reset_index(name='counts')

# Creating labels for boroughs with counts
labels = df.apply(lambda x: str(x[0]) + "\n (" + str(x[1]) + ")", axis=1)

# Extracting count values and defining colors for the treemap
sizes = df['counts'].values.tolist()
colors = [plt.cm.Spectral(i/float(len(labels))) for i in range(len(labels))]

# Creating the treemap visualization
plt.figure(figsize=(12,8), dpi= 80)
squarify.plot(sizes=sizes, label=labels, color=colors, alpha=.8)

# References used:
# https://www.machinelearningplus.com/plots/top-50-matplotlib-visualizations-the-master-plots-python/
# Visualization Type: Treemap
Out[144]:
<matplotlib.axes._subplots.AxesSubplot at 0x20d81fc16d8>

As we can observe from above plot, it's clear that Queens and Brooklyn have the most trees, making up about 51% of the total observed species in our sample. Manhattan, despite being the third-largest borough with over 1.6 million residents, has the fewest tree observations. This might be due to the high density of buildings. Brooklyn is the second most populated with 2.5 million residents, and Queens is third with around 2.3 million residents.

You can find resident data for 2018 in this source: https://www.citypopulation.de/php/usa-newyorkcity.php

In [145]:
# Group the 'df_api' DataFrame by tree species ('spc_common') and count the number of unique 'tree_id' values for each species.
df_aw = df_api.groupby('spc_common')['tree_id'].nunique().nlargest(30)
df_aw
Out[145]:
spc_common
London planetree          36919
honeylocust               27956
Callery pear              27478
pin oak                   21437
Norway maple              15567
cherry                    13077
Japanese zelkova          12781
littleleaf linden         12321
ginkgo                     9241
Sophora                    8733
red maple                  8660
green ash                  6559
American linden            6141
silver maple               5266
sweetgum                   4890
American elm               3579
northern red oak           3506
purple-leaf plum           3349
swamp white oak            2739
Chinese elm                2668
maple                      2611
silver linden              2484
'Schubert' chokecherry     2349
crimson king maple         2304
Japanese tree lilac        2036
golden raintree            1940
eastern redbud             1896
crab apple                 1751
willow oak                 1513
hedge maple                1503
Name: tree_id, dtype: int64

For use of further analysis, the table above represents the quantity of the most common tree species observed in our dataset.

Analysis of the top ten species distribution by boroughs¶

In [146]:
# Filter the dataset ('df_api' DataFrame) to select specific tree species by their common names.
df_top = df_api.loc[df_api['spc_common'].isin(['London planetree','honeylocust','Callery pear','pin oak','Norway maple','cherry','Japanese zelkova','littleleaf linden','ginkgo','Sophora'])]
df_top['spc_common'] = df_top['spc_common'].str.lower()
C:\Users\Avrora\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

In [147]:
# Create a cross-tabulation (crosstab) of tree species distribution in NYC boroughs using data from the 'df_top' DataFrame.
df_rapi1 = pd.crosstab(df_top.boroname, df_top.spc_common, margins= True)
df_rapi1
Out[147]:
spc_common callery pear cherry ginkgo honeylocust japanese zelkova littleleaf linden london planetree norway maple pin oak sophora All
boroname
Bronx 3395 2801 1759 6495 3066 2565 5193 2496 4280 1756 33806
Brooklyn 3740 2938 2337 6747 3998 3651 14816 3485 5046 2634 49392
Manhattan 3173 270 2810 6178 1580 1497 1789 137 2013 1968 21415
Queens 5174 3956 1780 5951 2678 3522 9082 7042 5812 1877 46874
Staten Island 11996 3112 555 2585 1459 1086 6039 2407 4286 498 34023
All 27478 13077 9241 27956 12781 12321 36919 15567 21437 8733 185510
In [148]:
# Create a heatmap visualization of the tree species distribution in NYC boroughs using the 'df_rapi1' DataFrame.
fig, ax = plt.subplots(figsize=(18,10))
sns.heatmap(df_rapi1, cmap="YlGnBu", annot=True, fmt="d",  ax=ax)
Out[148]:
<matplotlib.axes._subplots.AxesSubplot at 0x20d7ee06da0>

The graph above breaks down the agglomeration of NYC trees by boroughs, focusing on the top 10 tree species and a representative sample of 300 thousand observations. As mentioned before, the London Plane tree is the most common species in our sample, and it's predominantly found in Brooklyn, with nearly 15 thousand recorded observations. Following closely is the Callery Pear, mainly located in Staten Island, with around 12 thousand observations.

Analysis of the diameter size of the top five most common species¶

During the initial review of the dataset, we observed that certain observations had exceptionally large trunk diameters (DBH). To address this, we conducted basic outlier detection procedures and removed trees with a DBH exceeding 60 cm.

In [149]:
df_api.index
Out[149]:
RangeIndex(start=0, stop=300000, step=1)
In [150]:
# Filter the dataset ('df_api' DataFrame) to include only specific tree species of interest (based on heatmap).
df_mean = df_api.loc[df_api['spc_common'].isin(['London planetree','honeylocust','Callery pear','pin oak','Norway maple'])]
In [151]:
df_mean.head(5)
Out[151]:
address boroname curb_loc guards health latitude longitude nta_name problems spc_common spc_latin state status tree_dbh tree_id trunk_wire x_sp y_sp zip_city zipcode
1 147-074 7 AVENUE Queens OnCurb None Fair 40.794111 -73.818679 Whitestone Stones pin oak Quercus palustris New York Alive 21 200540 No 1.034456e+06 228644.8374 Whitestone 11357
2 390 MORGAN AVENUE Brooklyn OnCurb None Good 40.717581 -73.936608 East Williamsburg None honeylocust Gleditsia triacanthos var. inermis New York Alive 3 204026 No 1.001823e+06 200716.8913 Brooklyn 11211
3 1027 GRAND STREET Brooklyn OnCurb None Good 40.713537 -73.934456 East Williamsburg Stones honeylocust Gleditsia triacanthos var. inermis New York Alive 10 204337 No 1.002420e+06 199244.2531 Brooklyn 11211
5 8 COLUMBUS AVENUE Manhattan OnCurb Helpful Good 40.770046 -73.984950 Lincoln Square None honeylocust Gleditsia triacanthos var. inermis New York Alive 11 190422 No 9.884187e+05 219825.5227 New York 10023
6 120 WEST 60 STREET Manhattan OnCurb Helpful Good 40.770210 -73.985338 Lincoln Square None honeylocust Gleditsia triacanthos var. inermis New York Alive 11 190426 No 9.883112e+05 219885.2785 New York 10023
In [152]:
# Create a boxplot to visualize the distribution of tree diameters (DBH) for the selected tree species.
plt.figure(figsize=(18, 6))
sns.boxplot(x=df_mean['tree_dbh'])
Out[152]:
<matplotlib.axes._subplots.AxesSubplot at 0x20d074d4eb8>

As we can see from the graphical representation of the outliers based on tree diameter at breast height (DBH), the population mean is approximately 13 cm. The majority of the population, which falls within the interquartile range (between Q1 and Q3), is concentrated between 8 and 20 cm in diameter. This indicates the typical range for tree diameters within the dataset.

In [153]:
# Create a scatter plot to explore the relationship between tree species (SPC Common name) and their sizes (Tree Size). 
fig, ax = plt.subplots(figsize=(16,8))
ax.scatter(df_mean['spc_common'], df_mean['tree_dbh'])
ax.set_xlabel('SPC Common name')
ax.set_ylabel('Tree Size')
plt.show()

#This scatter plot helps us examine the distribution of tree sizes for different species, allowing us to observe any patterns or variations in tree size based on their common names.
In [154]:
# Remove tree observations with a DBH of 60 cm or greater from the 'df_mean' DataFrame.
df_mean.drop(df_mean[ df_mean['tree_dbh'] >= 60].index , inplace=True)
df_mean.describe()
C:\Users\Avrora\Anaconda3\lib\site-packages\pandas\core\frame.py:3697: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Out[154]:
latitude longitude tree_dbh x_sp y_sp zipcode
count 129319.000000 129319.000000 129319.00000 1.293190e+05 129319.000000 129319.000000
mean 40.696850 -73.946267 14.58699 9.991154e+05 193188.791780 10804.477749
std 0.100946 0.129126 8.65722 3.582568e+04 36775.042434 610.274034
min 40.498510 -74.254432 0.00000 9.134969e+05 120989.630100 83.000000
25% 40.613574 -73.994756 8.00000 9.857030e+05 162829.314000 10312.000000
50% 40.692852 -73.930817 13.00000 1.003426e+06 191746.680200 11101.000000
75% 40.768642 -73.858428 20.00000 1.023457e+06 219343.937550 11235.000000
max 40.910883 -73.700488 59.00000 1.067248e+06 271154.286500 11694.000000

Removing outliers with a DBH greater than 60 cm results in a more representative and improved distribution of the tree population's DBH sizes. This helps ensure that extreme values do not disproportionately impact the analysis, allowing for a more accurate understanding of the typical tree sizes in the dataset.

In [155]:
# Create a scatter plot to explore the relationship between tree species (SPC Common name) and their sizes (Tree Size) with outliers removed.
fig, ax = plt.subplots(figsize=(16,8))
ax.scatter(df_mean['spc_common'], df_mean['tree_dbh'])
ax.set_xlabel('SPC Common name')
ax.set_ylabel('Tree Size')
plt.show()
In [156]:
#Considering the nature of the date used violin plot to visualize the distribution of tree sizes for the five most common species, with some specific visual settings.
sns.set(style="whitegrid")
df = df_mean

# Create the violin plot with customized visual settings.
f, ax = plt.subplots(figsize=(18, 12))
sns.violinplot(x="spc_common",y="tree_dbh",data=df, inner="box", palette="Greens",  bw =.3, cut=.5, linewidth=3)

sns.despine(left=True)

# Set the plot title and axis labels.
f.suptitle('Tree Size Distribution (5 most common)', fontsize=14, fontweight='bold')
ax.set_xlabel("Tree's Common Name",size = 16,alpha=0.7)
ax.set_ylabel("Tree diameter measured at 4.5 feet above the ground ",size = 16,alpha=0.7)

# Resourses Used:
# https://mode.com/blog/violin-plot-examples
C:\Users\Avrora\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning:

Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.

Out[156]:
Text(0, 0.5, 'Tree diameter measured at 4.5 feet above the ground ')

Based on the information from the provided resources, the London plane tree is the largest among the representative sample. It can grow to a height of 20–30 meters (66–98 feet) and exceptionally even taller, with a trunk that can reach a circumference of 3 meters (10 feet) or more. On the other hand, the Callery Pear, also known as Pyrus calleryana, is the smallest, typically reaching a height of 5 to 8 meters (16 to 26 feet) with a conical to rounded crown.

The graph above depicts the overall diameter at breast height (DBH) size distribution of the top five species. It aligns with the information from the resources, highlighting that the London plane tree is the largest and tallest species in our representative sample. </font>

Resources used:

  1. Pyrus calleryana - https://en.wikipedia.org/wiki/Pyrus_calleryana
  2. Platanus × acerifolia - https://en.wikipedia.org/wiki/Platanus_×_acerifolia

An overview of the overall health of tree species in the representative sample, specifically by borough.¶

In [157]:
#df_api.groupby('boroname')['health'].unique()
#df_api.groupby('boroname')['health'].count()

# Group the 'df_api' DataFrame by borough and health, then count the occurrences of each combination.
df_h = df_api.groupby(['boroname', 'health']).size().reset_index(name='counts')
df_h
Out[157]:
boroname health counts
0 Bronx Fair 6847
1 Bronx Good 44362
2 Bronx Poor 2163
3 Brooklyn Fair 11053
4 Brooklyn Good 58229
5 Brooklyn Poor 3528
6 Manhattan Fair 5361
7 Manhattan Good 21181
8 Manhattan Poor 2039
9 Queens Fair 12920
10 Queens Good 54994
11 Queens Poor 4143
12 Staten Island Fair 7193
13 Staten Island Good 49506
14 Staten Island Poor 2081
In [158]:
df_h.dtypes
Out[158]:
boroname    object
health      object
counts       int64
dtype: object
In [159]:
# Set Plotly credentials with your username and API key.
plotly.tools.set_credentials_file(username='XXX', api_key='XXX')
# Initialize Plotly for notebook usage.
offline.init_notebook_mode(connected=True)
In [160]:
# Import the required modules from Plotly.
import plotly.offline as pyo
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
In [161]:
df_zz = pd.crosstab(df_api.boroname, df_api.health, margins= True)
df_zz.dtypes
Out[161]:
health
Fair    int64
Good    int64
Poor    int64
All     int64
dtype: object
In [162]:
#df_zz = df_zz.drop(columns="All")
#df_zz = df_zz.drop("All", axis=0)
df_plotly = df_zz
df_plotly
Out[162]:
health Fair Good Poor All
boroname
Bronx 6847 44362 2163 53372
Brooklyn 11053 58229 3528 72810
Manhattan 5361 21181 2039 28581
Queens 12920 54994 4143 72057
Staten Island 7193 49506 2081 58780
All 43374 228272 13954 285600
In [163]:
df_plotly = df_plotly.drop("All", axis=0)
In [164]:
# Define three Bar traces for 'Good', 'Fair', and 'Poor' health categories.
trace0 = go.Bar(
    x = df_plotly.index,
    y = df_plotly.Good,
    marker=dict(
        color='#3ACB5B'),
    name='Good'
)

trace1 = go.Bar(
    x = df_plotly.index,
    y = df_plotly.Fair,
        marker=dict(
        color='#609E64'),
    name='Fair'
)

trace2 = go.Bar(
    x = df_plotly.index,
    y = df_plotly.Poor,
        marker=dict(
        color='#B97D66'),
    name='Poor'
)

# Create Scatter traces for additional data visualization.
traces= [go.Scatter(
    x = df_plotly.columns,
    y = df_plotly.loc[rowname],
    mode = 'markers+lines',
    name = 'xxx'
) for rowname in df_plotly.index]

# Define the data and layout for the Plotly figure
data = [trace0, trace1, trace2]
layout = go.Layout(
    barmode='stack',  title='Trees Health Status by Borough')
fig = go.Figure(data=data, layout=layout)

# Display the figure using Plotly.
pyo.offline.iplot(fig, filename='stacked-bar')

#Source retrieved from: https://plot.ly/python/bar-charts/

The graph above provides an overview of the assumed health status of the tree population. It was generated using the Plotly library and includes dynamic columns. Based on the summary output of the cross-tabulation function, we can conclude that approximately 80% or 228,000 trees are in presumed 'Good Health' condition. The representative sample consists of 300,000 observations, and this conclusion is drawn from 285,000 observations, as the rest were dropped due to missing data.

Mapping location of the species base on the coordinates of drawn sample¶

In [165]:
# Importing the Gmaps Library and Datasets
import gmaps 
import gmaps.datasets 
In [166]:
# Randomly sample 10,000 rows from the 'df_api' DataFrame and display the first few rows.

random_api = df_api.sample(n=10000)
print(random_api.head())
                      address       boroname curb_loc guards health  \
209309    1337 EAST 29 STREET       Brooklyn   OnCurb   None   Good   
16903   300 WEST SERVICE ROAD  Staten Island   OnCurb   None   Poor   
115962       2575 FISH AVENUE          Bronx   OnCurb    NaN    NaN   
294086        74 ADLAI CIRCLE  Staten Island   OnCurb   None   Good   
117097          373 11 STREET       Brooklyn   OnCurb   None   Good   

         latitude  longitude                                    nta_name  \
209309  40.618539 -73.946270                                   Flatlands   
16903   40.585771 -74.193796           New Springville-Bloomfield-Travis   
115962  40.864909 -73.847263                     Allerton-Pelham Gardens   
294086  40.544407 -74.173142  Annadale-Huguenot-Prince's Bay-Eltingville   
117097  40.667320 -73.985843                          Park Slope-Gowanus   

            problems         spc_common           spc_latin     state status  \
209309          None   purple-leaf plum   Prunus cerasifera  New York  Alive   
16903   BranchLights  littleleaf linden       Tilia cordata  New York  Alive   
115962           NaN                NaN                 NaN  New York  Stump   
294086          None            pin oak   Quercus palustris  New York  Alive   
117097          None       sawtooth oak  Quercus acutissima  New York  Alive   

        tree_dbh tree_id trunk_wire          x_sp         y_sp       zip_city  \
209309         8  418697         No  9.991665e+05  164631.2653       Brooklyn   
16903          8  197701         No  9.304220e+05  152748.1578  Staten Island   
115962         0  330277         No  1.026496e+06  254424.0739          Bronx   
294086        17  507626         No  9.361291e+05  137666.1968  Staten Island   
117097         3  299202         No  9.881774e+05  182399.2681       Brooklyn   

        zipcode  
209309    11210  
16903     10314  
115962    10469  
294086    10312  
117097    11215  
In [167]:
# Configure gmaps with your personal API key
gmaps.configure("AIzaSyANVf8cXZ3KusFgBAnLuiaQcO6i9eVN3ZU") #personal key

# Create a hybrid map centered at specific coordinates with custom layout settings
fig = gmaps.figure(map_type="HYBRID", center=(40.706742, -73.816493), zoom_level=10,
                  layout={
        'width': '1200px',
        'height': '600px',
        'padding': '3px',
        'border': '1px solid black'
})

# Retrieve location data from the 'df_api' DataFrame
locations = df_api[['latitude', 'longitude']]

# Create a heatmap layer with customized settings
heatmap = gmaps.heatmap_layer(locations)
heatmap.max_intensity = 2
heatmap.point_radius = 2
heatmap.gradient = ['white', 'green']

# Add the heatmap layer to the map
fig.add_layer(heatmap)
fig
#Data retreived: https://jupyter-gmaps.readthedocs.io/en/latest/api.html
Figure(layout=FigureLayout(border='1px solid black', height='600px', padding='3px', width='1200px'))

Please make sure to place html file and included MapLocation.png in to one folder before you open html file. Thank you.

In [9]:
from IPython.display import Image
image_path = "C:\\dev\\a\\notebooks\\MapLocations.png"

# Display the image
Image(filename=image_path)
Out[9]:

To plot the coordinates of NYC tree census data, two different techniques were chosen: Google Maps (gmaps) and Seaborn libraries. When using the Gmap library, the map can only be visualized at runtime, and the only way to save the output as an HTML file is to take a screenshot and upload it into the Python environment, which was done in this case. Additionally, a Seaborn plot was created based on a sample of 10 thousand observations, providing an accurate representation of the five boroughs of NYC.

In [168]:
tree_df = df_api

# Create a scatterplot to visualize tree locations based on coordinates
plt.figure(figsize=[20,10])
sns.scatterplot(x='longitude', y='latitude', palette="YlGnBu", data=tree_df.sample(10000))

# Plot title
plt.title('Tree location by coordinates in seaborn');

#retrieved from:  https://seaborn.pydata.org/introduction.html

Data Persistence: We store the analyzed data in two different formats, SQLite and HDF5.¶

In [169]:
# Create a database in RAM
connection = sqlite3.connect(':memory:')
# Creates or opens a file called mySQLiteDB.sl3 with a SQLite3 DB
connection = sqlite3.connect('../datasets/mySQLiteDB.sl3')
cursor = connection.cursor()
In [170]:
cursor.execute("DROP TABLE IF EXISTS df_api")
df_api.to_sql(name='df_api', con=connection) 
#Code Retrieved from: https://stackoverflow.com/questions/14431646/how-to-write-pandas-dataframe-to-sqlite-with-index
In [ ]:
# Test Connection
df_sql = pd.read_sql('select spc_common from df_api', connection)
In [172]:
# Valid connection results
df_sql.head(10)
Out[172]:
spc_common
0 red maple
1 pin oak
2 honeylocust
3 honeylocust
4 American linden
5 honeylocust
6 honeylocust
7 American linden
8 honeylocust
9 London planetree
In [173]:
# Writing and storing dataset to an HDF5 file.
df_plotly.to_hdf('data.h5', key='df_plotly', mode='w')

Above exercise demonstrates data persistence skills in two formats, SQLite and HDF5.

Conclusion¶

The datasets we've worked with in this project offer a great chance to showcase a comprehensive data analysis workflow, covering data acquisition, data cleaning, data merging, data storage, and data analysis, all using Python. The variety of variables in the dataset allows for insights from multiple perspectives.

I truly admire, how much work has gone into collecting and maintaining this data. Given the ongoing nature of the Tree Census project, our analysis can serve as an interesting starting point for anyone looking to explore and understand New York City's tree population.